Skip to content

Feature/skills aggregator: add SkillEvalAggregator for batch evaluation comparisons#229

Draft
venkatkrish543re wants to merge 9 commits into
strands-agents:mainfrom
venkatkrish543re:feature/skills-aggregator
Draft

Feature/skills aggregator: add SkillEvalAggregator for batch evaluation comparisons#229
venkatkrish543re wants to merge 9 commits into
strands-agents:mainfrom
venkatkrish543re:feature/skills-aggregator

Conversation

@venkatkrish543re

@venkatkrish543re venkatkrish543re commented May 14, 2026

Copy link
Copy Markdown

No description provided.

ybdarrenwang and others added 9 commits May 6, 2026 19:21
Adds skills/ subpackage providing paired-comparison aggregation for evaluating agent skills against a baseline. Mirrors the chaos aggregator pattern from feature/aggregator-demo-2.

- SkillEvalAggregator with Wilcoxon, paired-t, and McNemar tests

- Bootstrap CI on the mean delta (1000 resamples)

- Corruption filtering before paired statistics

- SkillEvalExperiment composes base Experiment

- Rich-based interactive display

- 44 unit tests covering paired stats, corruption filtering, pairing, and serialization

Closes strands-agents#228
@yonib05 yonib05 added area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-evaluators Evaluators: output, trajectory, tool use, interactions, and LLM-as-judge quality metrics enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants